Skip to content

perf(decimal): SIMD kernels for d64/d128/d256 and SUM(decimal) reduction#24257

Draft
aunjgr wants to merge 5 commits intomatrixorigin:mainfrom
aunjgr:decimal-perf
Draft

perf(decimal): SIMD kernels for d64/d128/d256 and SUM(decimal) reduction#24257
aunjgr wants to merge 5 commits intomatrixorigin:mainfrom
aunjgr:decimal-perf

Conversation

@aunjgr
Copy link
Copy Markdown
Contributor

@aunjgr aunjgr commented Apr 29, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24097

What this PR does / why we need it:

Add a new pkg/common/simdkernels package providing SIMD-accelerated kernels for decimal arithmetic, gated by goexperiment.simd (AVX2 required, AVX-512 used opportunistically when available):

  • d64: add/sub (vector & broadcast, checked & unchecked), compare, multiply helper, scale (×10^k)
  • d128: add/sub (vector & broadcast, checked & unchecked), neg/abs, sign-extension from d64 (amd64 asm with prefetch; pure-Go fallbacks for arm64 and other archs)
  • d256: add/sub, neg/abs

Wire the kernels into hot paths:

  • func_cast.go: d64 → d128 cast uses Decimal64SignExtend
  • arith_decimal_fast.go: d128 add/sub broadcast paths use the SIMD vector/scalar+vector/vector+scalar kernels
  • aggexec/sum_decimal_fast.go: SUM(decimal64) and SUM(decimal128) reduction uses the SIMD sum-reduce kernel for runs ≥ 32 elements

Makefile now passes GOEXPERIMENT=simd to go build so the kernels are enabled by default.

TPC-H SF100 wall-time wins (Zen 3, 24-core, median of 5):

Query Baseline This change Δ
Q1 12.78s 11.98s -6.3%
Q5 4.13s 3.43s -16.9%
Q9 12.51s 10.70s -14.4%
Q14 2.67s 2.26s -15.4%

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Add a new `pkg/common/simdkernels` package providing SIMD-accelerated
kernels for decimal arithmetic, gated by `goexperiment.simd` (AVX2
required, AVX-512 used opportunistically when available):

- d64: add/sub (vector & broadcast, checked & unchecked), compare,
  multiply helper, scale (×10^k)
- d128: add/sub (vector & broadcast, checked & unchecked), neg/abs,
  sign-extension from d64 (amd64 asm with prefetch; pure-Go fallbacks
  for arm64 and other archs)
- d256: add/sub, neg/abs

Wire the kernels into hot paths:

- `func_cast.go`: d64 → d128 cast uses `Decimal64SignExtend`
- `arith_decimal_fast.go`: d128 add/sub broadcast paths use the SIMD
  vector/scalar+vector/vector+scalar kernels
- `aggexec/sum_decimal_fast.go`: SUM(decimal64) and SUM(decimal128)
  reduction uses the SIMD sum-reduce kernel for runs ≥ 32 elements

`Makefile` now passes `GOEXPERIMENT=simd` to `go build` so the kernels
are enabled by default.

TPC-H SF100 wall-time wins (Zen 3, 24-core, median of 5):

| Query | Baseline | This change | Δ      |
|-------|----------|-------------|--------|
| Q1    | 12.78s   | 11.98s      | -6.3%  |
| Q5    |  4.13s   |  3.43s      | -16.9% |
| Q9    | 12.51s   | 10.70s      | -14.4% |
| Q14   |  2.67s   |  2.26s      | -15.4% |

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Int64x8.AndNot has inverted operand semantics compared to Int64x4.AndNot
on Go 1.26.2 (VPANDNQ computes ~receiver & arg rather than receiver & ~arg).
This caused all AVX-512 checked-add overflow detection to silently miss
overflows, returning -1 instead of the overflow index.

Fix: swap operands in all 6 AVX-512 AddChecked functions (d64, d128, d256
vector and scalar-broadcast variants).

Also replace custom itoa() with strconv.Itoa to fix build without
GOEXPERIMENT=simd (d64_compare_test.go referenced itoa from a
build-tagged file).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aunjgr and others added 3 commits May 5, 2026 01:01
# Conflicts:
#	pkg/sql/colexec/aggexec/sum_decimal_fast.go
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants